What do Deep Networks Like to Hear?¶
This work is based on the paper What do Deep Networks Like to See? by Palacio et al. in which the authors analyse CNN networks by finetuning an image-autoencoder on the gradients of a fixed classifier. To do so they constructed a pipeline where the input image is first fed through the autoencoder and the resulting image-reconstruction is then passed to the fixed classifier to obtain the final image class predictions. The weights of the autoencoder are then updated by the gradients of the prediction error which is backpropagated all the way through the frozen image classifier, the reconstructed input image and finally through the autoencoder.
This work extends this idea of classifier analysis to audio waveforms and acoustic scene classification. For this a pre-trained audio waveform autoencoder is finetuned to analyse 3 classifiers with different architectures on the ESC50 dataset. The autoencoder is taken from the ArchiSound GitHub repository and the classifiers are from the EfficienAT GitHub and PaSST GitHub repositories.
The Dataset¶
For the experiments the ESC50 dataset was used. It consists of 2,000 environmental sound recordings. Each audio file is 5 seconds long and belongs to one of 50 classes. These classes can then be further clustered into 5 major categories:
| Animals | Natural soundscapes & water sounds | Human, non-speech sounds | Interior/domestic sounds | Exterior/urban noises |
|---|---|---|---|---|
| Dog | Rain | Crying baby | Door knock | Helicopter |
| Rooster | Sea waves | Sneezing | Mouse click | Chainsaw |
| Pig | Crackling fire | Clapping | Keyboard typing | Siren |
| Cow | Crickets | Breathing | Door, wood creaks | Car horn |
| Frog | Chirping birds | Coughing | Can opening | Engine |
| Cat | Water drops | Footsteps | Washing machine | Train |
| Hen | Wind | Laughing | Vacuum cleaner | Church bells |
| Insects (flying) | Pouring water | Brushing teeth | Clock alarm | Airplane |
| Sheep | Toilet flush | Snoring | Clock tick | Fireworks |
| Crow | Thunderstorm | Drinking, sipping | Glass breaking | Hand saw |
The original dataset uses a sample rate of 44.1 kHz but in this work the version provided by the PaSST repository is used where the audio files are resampled to 32 kHz. Furthermore, the dataset is organized in 5 folds of which folds 2 to 5 are used for training and fold 1 is reserved for validation.
Classifiers¶
In this work 3 classifiers with 3 different architectures are analysed:
The MobileNet is an efficient CNN used in this scenario to classify mel-spectrograms of audio waveforms. The Dynamic MobileNet modifies the plain architecture by introducing dynamic elements that enable attention in the model. The PaSST model is a transformer based sound classifier also operating on mel-spectrogram level. The MobileNet as well as the Dynamic MobileNet were obtained through knowledge distillation based on the PaSST model and then finetuned on the ESC50 dataset.
Audio samples¶
Below, a selection of audio samples from the ESC50 dataset is presented, accompanied by their corresponding reconstructions generated by the different autoencoder models.
Autoencoder pretrained on ESC50¶
The following audiosamples stem for pre-trained audio autoencoder which were then fine-tuned on the classification gradients of the different classifiers.
Index 8 - crow
Original
MN Autoencoder
DyMN Autoencoder
PaSST Autoencoder
Index 12 - clapping
Original
MN Autoencoder
DyMN Autoencoder
PaSST Autoencoder
Index 31 - water drops
Original
MN Autoencoder
DyMN Autoencoder
PaSST Autoencoder
Index 66 - crackling fire
Original
MN Autoencoder
DyMN Autoencoder
PaSST Autoencoder
Index 86 - insects
Original
MN Autoencoder
DyMN Autoencoder
PaSST Autoencoder
Index 102 - pig
Original
MN Autoencoder
DyMN Autoencoder
PaSST Autoencoder
Autoencoder trained from scratch¶
The following samples stem from autoencoder which were randomly initialized and the fine-tuned on the classification gradients of the corresponding classifiers.